home *** CD-ROM | disk | FTP | other *** search
- IEN 104
-
-
-
- Minutes of the Fault Isolation Meeting
-
- 12 March 1979
-
-
- Virginia Strazisar
-
-
-
- Bolt, Beranek, and Newman
-
- 20 March 1979
-
-
- Minutes of the Fault Isolation Meeting held at BBN on March 12
-
- Attendees:
- Virginia Strazisar, BBN, chairman
- Peter Sevcik, BBN
- Dale McNeill, BBN
- Noel Chiappa, MIT
- Ray McFarland, DOD
- Mike Wingfield, BBN
- Jack Haverty, BBN
- Bill Plummer, BBN
- Mike Brescia, BBN
-
- Ginny suggested that there are three situations in which fault
- isolation is needed: 1) the user at a terminal on the catenet
- who cannot reach some destination on the catenet, 2) a catenet
- control center that must decide what network or gateway in the
- catenet has failed, and 3) the gateway implementor who must
- decide what part of the gateway hardware or software has failed.
- These situations were put forth as a framework for discussing the
- types of fault isolation facilities that we need. Ginny stated
- that the object of the meeting was to draw up a list of fault
- isolation tools needed, giving special consideration to what
- situations each of these tools would be used in and what
- questions they could be used to answer. From the suggestions
- drawn up at the meeting, the detailed formats and protocols could
- be designed; this level of design was specifically avoided at the
- meeting.
-
- The first situation discussed was the user at a catenet terminal,
- who discovers that he either cannot connect to a particular
- destination host or that he no longer gets any response from his
- previously working connection. At present no information is
- passed to the user in either of these cases. Everyone agreed
- that the user should receive some error reply. It was suggested
- that the user should receive a response indicating that either 1)
- the destination host is unreachable, 2) the local gateway or
- network is unreachable or 3) the catenet is inoperational. Most
- people agreed that the naive user does not care to know what the
- catenet problems are in any more detail than this. For example,
- an error messgage of the form "Can't reach destination network
- because gateway 3 is down" would be totally useless to the naive
- user. The user also wants to know when the service will be
- restored, either "within a short time" such that the user is
- willing to wait for the service to be restored; or "not for a
- long time" such that the user will quit trying to use the service
- at this time. Several people pointed out that a more
- sophisticated user may want to know exactly what component of the
- catenet failed. There was some discussion as to whether users
- should be given access to tools that would enable them to probe
- the catenet gateways to determine where the failure occurred.
-
-
-
-
-
-
- The consensus of opinion was that the user should be given access
- to such tools, but that no user should be required to use such
- tools. Our model was that the naive user on receiving an error
- message would call a network or catenet control center, whereas
- the more sophisticated user may attempt to track down the problem
- before contacting the control center. We discussed in more
- detail what sort of message a gateway could return to the user.
- It was suggested that if the network returned an error message
- about a specific host that that error message (text) should be
- returned verbatim to the user. It was also suggested that error
- codes be defined for "common" failures, i.e. net down, host down,
- and that these be included in the error message. It was pointed
- out that the gateways currently return messages to the source
- host if they believe (based on their routing information) that
- the destination network is unreachable. These messages contain
- the source and destination addresses and the protocol field from
- the original datagram. Several people pointed out that this
- information is insufficient to return an error message to the
- source user and that the entire internet header of the original
- datagram should be returned in the error message. We discussed
- the problem of what to do in the case where datagrams are lost in
- a gateway or network in such a manner that no error message is
- generated and returned to the source. It was decided in this
- case that the source host should automatically probe the gateways
- in order to return a reasonable status message to the user. It
- was assumed that the user is running a program that implements
- some type of internet protocol, such as TCP, and that that
- program is capable of detecting long delays or mutiple
- retransmisssions and of generating some type of probe packet to
- attempt to track down the failure when this occurred. These
- probe packets are discussed in more detail below. Information
- obtained from such probing could also be sent to a monitoring
- center.
-
- We discussed the concept of a monitoring or control center. The
- primary purpose of a monitoring or control center in terms of
- fault isolation is to isolate the component (network or gateway)
- that failed and to notify the proper authority to have it fixed.
- We felt that a control center was needed to avoid having all the
- users in the catenet calling any and all implementors they felt
- might be responsible for problems. The concept of a single
- control center was discussed and rejected for both technical and
- political reasons. From the technical point of view, it was
- pointed out that the catenet could become partitioned such that
- the control center was cut off from part of the catenet and thus
- could no longer handle faults in that portion of the catenet. On
- the political side, it was pointed out that organizations
- responsible for the individual networks may be unwilling to
- support one control center run by one organization. We agreed
- that the catenet control center should actually be multiple
- control centers. These could be either the existing network
-
-
-
-
-
-
- control centers working in co-operation or separate catenet
- control centers, each of which was established by co-operating
- network groups. Tools that these control centers would need
- included a facility to probe gateways to determine why a
- particular destination was unreachable.
-
- We elaborated slightly on the design of a facility for probing
- gateways. A host or control center sends its local gateway a
- message saying "poll the gateways in the catenet to determine why
- I can not get to destination X". The gateway then polls its
- neighbors, its neighbors' neighbors, etc., extracting routing
- tables, addresses of neighbor gateways, status of neighbor
- gateways and networks, etc. to determine why the destination is
- unreachable. The gateway would then formulate a response to the
- host; this response would be of the form: "the network
- connection between gateway 3 and net 2 is down", "gateway 5 and
- gateway 6 are down", etc. This mechansim would be an extension
- of the gateway-gateway protocol as defined in IEN #30. This
- probe facility would be used by the source host to generate a
- message to the user in the case where no response is recieved
- from the destination and no error message is returned by the
- gateways. The facility would also be used by catenet control
- centers to isolate the componenet of the catenet that has failed.
-
- It was pointed out that we should be concerned not only with
- total failures, but also with system performance, especially
- delay. In this context, we were not concerned with cases where
- delay seemed slightly longer than usual, but rather cases in
- which traffic crossed the catenet with extrememly high delays,
- i.e several minutes. A facility was suggested to track this sort
- of problem: generate a packet from source A addressed to
- destination B; have this packet trace its route and timestamp it
- at each gateway on the route to B; at B, echo the packet; return
- the packet to the source, A, using source routing and the route
- stored in the packet via the trace mechanism; timestamp the
- packet on its route back to A. The timestamps in the packet
- could now be interpreted to yield transit times across each
- network as there would be a pair of timestamps for each gateway
- traversed.
-
- The final stage of fault isolation is the situation in which the
- failure has been attributed to a particular gateway and the
- implementor of that gateway must debug it. This part of fault
- isolation was not discussed in detail. It was suggested that at
- this point, it would be very useful to be able to turn off
- timeouts in the catenet to avoid having the state of the catenet
- change in such a way that the problem can no longer be isolated.
-
- In summary, the following list of tools and situations in which
- they would be used was suggested.
-
-
-
-
-
-
-
- 1) Error messages indicating whether the destination host, the
- local network or gateway, or the catenet had failed, and
- indicating the time at which service should be restored.
-
- These are to be returned automatically to the catenet user
- whenever there is a failure in using a catenet service.
-
- 2) Gateway to gateway probing mechanism that can be initiated
- with a host to gateway message.
-
- This mechanism would be used by a control center to isolate a
- component failure. It would also be available to the user. It
- would be used by source host protocol programs to formulate an
- error message for the user when no repsonse was received from the
- destination and no error message was received from the gateways.
-
- 3) Ability to trace, echo and source route packet with
- timestamping.
-
- This facility would be used to determine where delays are
- occurring when a destination is reachable, but delays cannot be
- accounted for.
-
- 4) Ability to echo packets off any gateway.
- 5) Ability to trace packets.
- 6) Ability to source route packets.
- 7) Ability to dump gateway tables.
- 8) Ability to trace packets by sending replies from every
- gateway that handles the packet.
-
- These capabilities would be used by control centers and gateway
- implementors to isolate failed components and determine the
- reasons for failure. These facilities were not discussed in
- detail. A description of mechanisms for tracing packets and
- source routing packets was given in IEN #30, although these have
- not yet been implemented.
-
- The next step in developing fault isolation mechanisms for the
- catenet is to work out the detailed design for the mechanisms
- suggested above, and to implement these in hosts, gateways and
- control centers.
-
-